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We present a simple physical model which demonstrates that 
the native state folds of proteins can emerge on the basis of con- 
siderations of geometry and symmetry. We show that the inherent 
anisotropy of a chain molecule, the geometrical and energetic con- 
straints placed by the hydrogen bonds and sterics, and hydrophobicity 
are sufficient to yield a free energy landscape with broad minima even 
for a homopolymer. These minima correspond to marginally com- 
pact structures comprising the menu of folds that proteins choose 
from to house their native-states in. Our results provide a general 
framework for understanding the common characteristics of globular 
proteins. 

Protein folding 13 El El El j s complex because of the sheer size of protein 
molecules, the twenty types of constituent amino acids with distinct side 
chains and the essential role played by the environment. Nevertheless, pro- 
teins fold into a limited numbeiP'^of evolutionarily conserved structured^. 
It is a familiar, yet remarkable, consequence of symmetry and geometry that 
ordinary matter crystallizes in a limited number of distinct forms. Indeed, 
crystalline structures transcend the specifics of the various entities housed in 
them. Here we ask the analogous question^ is the menu of protein folds 
also determined by geometry and symmetry? 

We show that a simple model which encapsulates a few general attributes 
common to all polypeptide chains, such as steric constraints^ E3ES1 hydrogen 
bonding^ E3 EH1 anc l hydrophobicity^, gives rise to the emergent free energy 
landscape of globular proteins. The relatively few minima in the resulting 
landscape correspond to putative marginally-compact native-state structures 
of proteins, which are assemblies of helices, hairpins and planar sheets. A 
superior frP^lEHl f a given protein or sequence of amino acids to one of these 
pre-determined folds dictates the choice of the topology of its native-state 
structure. Instead of each sequence shaping its own free energy landscape, 
we find that the overarching principles of geometry and symmetry determine 
the menu of possible folds that the sequence can choose from. 

Following BernaP^I, the protein problem can be divided into two distinct 
steps: first, analogous to the elucidation of crystal structures, one must iden- 
tify the essential features that account for the common characteristics of 
all proteins; second, one must understand what makes one protein different 
from another. Guided by recent workP^El which has shown that a faithful 
description of a chain molecule is a tube and using information from known 
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protein native state structures, our focus, in this paper, is on the first step 
- we demonstrate that the native-state folds of proteins emerge from consid- 
erations of symmetry and geometry within the context of a simple model. 

We model a protein as a chain of identical amino acids, represented by 
their C a atoms, lying along the axis of a self-avoiding flexible tube. The pref- 
erential parallel placement of nearby tube segments approximately mimics 
the effects of the anisotropic interaction of hydrogen bonds, while the space 
needed for the clash-free packing of side chains is approximately captured 
by the non-zero tube thickness^ Here we carefully incorporate these key 
geometrical features via an extensive statistical analysis of experimentally 
determined native state structures in the Protein Data Bank (PDB). 

A tube description places constraints on the radii of circles drawn through 
both local and non-local triplets of C a positions of a protein native structure^"^. 
Furthermore, when one deals with a chain molecule, the tube picture under- 
scores the crucial importance of knowing the context that an amino acid is 
in within the chain. The standard coarse-grained approach considers the 
locations of interacting amino acid pairs. Here, instead, we incorporate the 
strongly directional hydrogen bonding between a pair of amino acids, through 
an analysis of the PDB to determine the constraints on the mutual orienta- 
tion of the local coordinate systems defined from a knowledge of the locations 
of the C a atoms (see Methods and Figure 1). The geometrical constraints 
associated with the tube and hydrogen bonds, that we consider here, are rep- 
resentative of the typical aspecific behavior of the interacting amino acids. 

There are two other ingredients in the model: a local bending penalty 
which is related to the steric hindrance of the amino acid side chains and 
a pair-wise interaction of the standard type mediated by the watei^. Even 
though these two properties clearly depend on the specific amino acids in- 
volved in the interaction, here we choose to study the phase diagram of a 
Ziomo-peptide chain by varying its overall hydrophobicity and local bending 
penalty, while keeping them constant along the chain. This is the simplest 
and most general way to assess their relevance in shaping the free-energy 
landscape. 

Methods 

Tube geometry. The protein backbone is modeled as a chain of C a atoms 
(Figure 2a) with a fixed distance of 3.8A between successive atoms along the 
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chain, an excellent assumption for all but non-cis Proline amino acids 24 . The 
geometry imposed by chemistry dictates that the bond angle associated with 
three consecutive C a atoms is between 82° and 148°. 

Self-avoiding conformations of the tube whose axis is the protein backbone 
are identified by considering all triplets of C a atoms and drawing circles 
through them and ensuring that none of their radii is smaller than the tube 
radius^ (Figure 2a). At the local level, the three body constraint ensures 
that a flexible tube cannot have a radius of curvature any smaller than the 
tube thickness in order to prevent sharp corners whereas, at the non-local 
level, it does not permit any self-intersections. There is an inherent local 
anisotropy due to the special direction singled out by consecutive atoms along 
the chain which enforces a preference for parallel alignment of neighboring 
tube segments in a compact conformation. 

The backbone of C a atoms is treated as a flexible tube of radius 2.5A, 
a constraint imposed on all (local and non-local) three body-radii, an as- 
sumption validated for protein native structures^. It is interesting to note 
that recent observations of residual dipolar couplings in short peptides^ in 
the denatured state have demonstrated their stiffness and their anisotropic 
deformability - the building blocks of proteins are relatively stiff segments 
with strong directional preferences. 

Sterics. Steric constraints require that no two non-adjacent C a atoms are al- 
lowed to be at a distance closer than 4A. Ramachandran and SasisekhararP] 
showed that steric considerations based on a hard sphere model lead to clus- 
tering of the backbone dihedral angles in two distinct a and (3 regions for 
non-glycyl and non-prolyl residues. The two backbone geometries that allow 
for systematic and extensive hydrogen bonding^ are the a-helix and 
the /5-sheet obtained by a repetition of the backbone dihedral angles from 
the two regions respectiveljP^. Short chains rich in alanine residues, which 
are a good approximation to a stretch of the backbone, can adopt a heli- 
cal conformation in water (see 123 ESI ESI ESI E3 E3 f or a detailed discussion of 
experimental conditions that would lead to a helical conformation). How- 
ever, when one has more heterogeneous side chains, the helix backbone could 
sterically clash with some side chain conformers resulting in a loss of con- 
formational entropj^". When the price in side chain entropy is too large, 
an extended backbone conformation results pushing the segment towards a 
/3-strand structured These steric constraints are approximately imposed 
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through an energy penalty (denoted by e#) when the local radius of cur- 
vature is between 2.5A and 3.2A. (The magnitude of the penalty does not 
depend on the specific value of the radius of curvature provided it is between 
these values.) There is no cost when the local radius exceeds 3.2A. Note that 
the tube constraint does not permit any local radius of curvature to take on 
a value less than the tube radius, 2.5A. 

Hydrogen bonds. We do not allow more than two hydrogen bonds to form 
at a given C a location. In our representation of the protein backbone, local 
hydrogen bonds form between C a atoms separated by two residues along the 
sequence with an energy defined to be —1 unit, whereas non-local hydrogen 
bonds are those that form between C a atoms separated by more than three 
residues along the sequence with an energy of —0.7. This energy difference is 
based on experimental findings that the local bonds provide more stability to 
a protein than do the non-local hydrogen bondJ^. Cooperativity effect J^"^ 
are taken into account by adding an energy of —0.3 units when consecutive 
hydrogen bonds along the sequence are formed. There is some latitude in the 
choice of the values of these energy parameters. The results that we present 
are robust to changes (at least of the order of 20%) in these parameters. 

Geometrical constraints due to hydrogen bonding. Three non-collinear 
consecutive atoms (i — + 1) of the chain define a plane. At atom % (spe- 
cial care is needed to adapt these rules to atoms at the C and iV-termini), 
one may define a tangent vector (along the direction joining the % — 1 and 
2 + 1 atoms) and a normal vector (along the direction joining the i-th atom 
and the center of the circle passing through the three atoms) , which together 
define a plane. One then defines a binormal vector hi perpendicular to the 
plane with the tangent, normal and binormal forming a right-handed local 
coordinate system (Figure 1). This coordinate system defines the context 
of an amino acid within a chain, a feature that plays a crucial role in the 
tube picture. For hydrogen bond formation between atom i and j, the dis- 
tance between these atoms ought to be between 4.7A and 5.6A (4.1A and 
5.3A) for the local (non-local) case. A study of protein native state struc- 
tures reveals an overall nearly parallel alignment of the axes defined by three 
vectors: the binormal vectors at i and j and the vector joining the i and j 
atoms. A hydrogen bond is allowed to form only when the binormal axes are 
constrained to be within 37° of each other, whereas the angle between the 



5 



binomial axes and that denned by ought to be less than 20°. Additionally, 
for the cooperative formation of non-local hydrogen bonds, one requires that 
the corresponding binormal vectors of successive C a atoms make an angle 
greater than 90°. The first and the last residues of the chain are special 
cases since their binormal vectors are not defined. In order for such residues 
to form a hydrogen bond (with each other or with other internal residues 
in the chain), it is required that the angle between the associated ending 
peptide link and the connecting vector to the other residue participating in 
the hydrogen bond is between 70° and 110°. As in real protein structures, 
when helices are formed, they are constrained to be right-handed. This con- 
straint is enforced by requiring that the backbone chirality associated with 
each local hydrogen bond is positive. The chirality is defined as the sign of 
the scalar product {r i)i+1 x f i+1A+2 ) ■ r i+2 ,i+3- 

Our approach for the derivation of the geometrical constraints imposed 
by hydrogen bonds is similar to that carried out at the level of an all-atom 
description of the protein chairP^. For the simpler C a atom based description, 
hydrogen bond energy functions have been introduced previously^"^ but 
without any input from a statistical analysis of protein structures. 

Hydrophobic interactions. The hydrophobic (hydrophilic) effects medi- 
ated by the water are captured through a relatively weak interaction, eyy, 
(either attractive or repulsive) between C a atoms which are within 7. 5 A of 
each other (Figure 2c). Note that hydrogen bonds can easily be formed be- 
tween the amino acid residues in an extended conformation and the water 
molecules. Within our model, the intrachain hydrogen bond interaction in- 
troduces an effective attraction, because water molecules are not explicitly 
present. The hydrophobicity scale is thus renormalized (e.g. even when eyy is 
weakly positive, there could be an effective attraction resulting in structured 
conformations such as a single helix or a planar sheet). A negative e\y is, in 
any case, crucial for promoting the assembly of secondary motifs in native 
tertiary arrangements. The properties of the model are summarized in Table 
1. 

Results and Discussion 

Figure 3 shows the ground state phase diagram obtained from Monte-Carlo 
computer simulations using the simulated annealing technique^. (The solvent- 
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mediated energy, ew, and the local radius of curvature energy penalty 
(see Methods for a description of the energy parameters) are measured in 
units of the local hydrogen bond energy.) When ew is sufficiently repulsive 
(hydrophilic) (and e# > 0.3 in the phase diagram), one obtains a swollen 
phase with very few contacts between the C a atoms. When ew is sufficiently 
attractive, one finds a very compact, globular phase with featureless ground 
states with a high number of contacts. 

Between these two phases (and in the vicinity of the swollen phase), a 
marginally compact phase emerges (the interactions barely stabilize the or- 
dered phase) with distinct structures including a single helix, a bundle of two 
helices, a helix formed by /3-strands, a /5-hairpin, three-stranded /3-sheets with 
two distinct topologies and a /3-barrel like conformation. Strikingly, these 
structures are the stable ground states in different parts of the phase dia- 
gram. Furthermore, other conformations, closely resembling distinct super- 
secondary arrangements observed in proteins^, such as the /3-a-/3 motif, are 
found to be competitive local minima, whose stability can be enhanced by 
sequence design (for example, non-uniform values of curvature energy penal- 
ties for single amino acids and hydrophobic interactions for amino acid pairs). 
Figure 4 shows a compendium of various structures obtained in our studies 
including, for comparison, a generic compact conformation of a conventional 
polymer chain (with no tube geometry or hydrogen bonds), which neither 
is made up of helices or sheets nor possesses the significant advantages of 
protein structures. While there is a remarkable similarity between the struc- 
tures that we obtain and protein folds, our simplified coarse-grained model 
is not as accurate as an all-atom representation of the poly-peptide chain in 
capturing features such as the packing of amino acid side chains. 

The fact that different putative native structures are found to be com- 
peting minima for the same homopolymeric chain clearly establishes that 
the free-energy landscape of proteins is pre-sculpted by means of the few 
ingredients utilized in our model. At the same time, relatively small changes 
in the parameters ew arid e# lead to significant differences in the emergent 
ground state structure, underscoring the sensitive role played by chemical 
heterogeneity in selecting from the menu of native state folds. 

Figure 5a is a contour plot of the free energy at a temperature higher 
than the folding transition temperature (identified by the specific heat peak) 
for the parameter values ew = —0.08 and e# = 0.3 for which the ground 
state is an a-helix (Figure 3). The free energy landscape has just one mini- 
mum corresponding to the denatured phase whose typical conformations are 
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somewhat compact but featureless. The contour plot at the folding tran- 
sition temperature (Figure 5b) has three local minima corresponding to an 
a-helix, a three-stranded /9-sheet and the denatured state. At lower temper- 
atures, the a-helix is increasingly favored and the /3-sheet is never the global 
free energy minimum. Many protein folding experiments show that for small 
globular proteins, at the transition temperature, only two states (folded and 
unfolded) are populated. The appearance in the present framework of mul- 
tiple states for a homopolymer chain suggests that two state folders might 
have been evolutionarily selected by sequence design favoring the native state 
conformation over competing folds in the pre-sculpted landscape. 

Such a design is indeed straightforward within our model. For example, 
the a-(3-a motif shown in Figure 4d (which is a local energy minimum for 
a homopolymer) can be stabilized into a global energy minimum for the 
sequence HPHHHPPPPHHPPHHPPPPHHHPP, with e w = -0.4 for HH 
contacts and ew = for other contacts, and e^ = 0.3 for all residues. 

It is interesting to note that lattice models of compact homopolymers 
yield large amounts of secondary structured-local radius of curvature con- 
straints are built into lattice models. However, an all atom study of poly- 
alanine has shown that compactness alone is insufficient to obtain secondary 
structured. Even a simple tube subject to an attractive self-interaction fa- 
voring compaction has a tendency to form helices, hairpins and sheets when 
the ratio of the tube thickness to the range of attractive interaction is tuned 
properly^. Our work here underscores the importance of hydrogen bonds 
in stabilizing both helices and sheets simultaneously (without any need for 
adjustment of the tube thickness) allowing the formation of tertiary arrange- 
ments of secondary motifs. Indeed, the fine-tuning of the hydrogen bond 
and the hydrophobic interaction is of paramount importance in the selec- 
tion of the marginally compact region of the phase diagram in which protein 
native folds are found. It is also important to note that proteins are rela- 
tively short chain molecules compared to conventional polymers. These are 
special features of proteins, which distinguishes them from generic compact 
polymers. 

A free energy landscape with a 1000 or so minima 7 with correspondingly 
large basins of attraction leads to stability and diversity, the dual characteris- 
tics needed for evolution to be successful. Proteins are those sequences which 
fit welP into one of these minima and are relatively stable. Yet, the fact 
that the marginally compact phase lies in the vicinity of a phase transition to 
the swollen phase allows for an exquisite sensitivity of protein structures to 
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the right types of perturbations. Thus a change in the external environment 
(e.g. an ATP molecule binding to the protein) could reshape the free energy 
landscape allowing for a different, stable and easily foldable conformation. 

In summary, within a simple, yet realistic, framework, we have shown that 
protein native-state structures can arise from considerations of symmetry and 
geometry associated with the polypeptide chain. The sculpting of the free 
energy landscape with relatively few broad minima is consistent with the 
fact that proteins can be designed to enable rapid folding to their native 
states. The limited number of folds arises from the geometrical constraints 
imposed by sterics and hydrogen bonds. In the marginally compact phase, 
not only does one have a space-filling conformation (the nearby backbone 
segments have to be placed near each other in order to avail of the attractive 
potential), which is effective in expelling water from the hydrophobic core, 
but also these segments need to have the right orientation with respect to 
each other in order to respect the geometrical constraints imposed by the 
hydrogen bonds. 
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Table 1. Properties of the model 



Parameter 


Constraint 


Tube approximation ^ 


R ijk > 2.5A, Vi <j <k 


local radius of curvature 


2.5A < Ri-i,i,i+i < 7.9A, VI < % < 


self avoidance 


nj > 4A, Vi < j - 1 


amino acid specific? 


no 


Local hydrogen bond ^ 


j =1 + 6 


C a -C a distance 


4. (A < Tij < o.bA 


bmormal-bmormal correlation 1 > 


\bi -bj\ > 0.8 


1 * 1 * ( H e> i\ 

bmormal-connectmg vector 1 ' ; 


\bi ■ Cij\ > 0.94, \bj ■ Cij\ > 0.94 


"U * 1 * J- 

cmrality 


{n, i+ i x r i+lji+2 ) ■ r i+2ii+3 > 


energy 


-1 


amino acid specific? 


no 


Non-local hydrogen bond 


j > i + 4 


C a -C a distance 


4.1 A < Tij < 5.3A 


binormal-binormal correlation*^ 


\bi ■b j \> 0.8 


binormal-connecting vector^' 6 ) 


\bi ■ Cij\ > 0.94, \bj ■ Cij\ > 0.94 


energy 


-0.7 


amino acid specific? 


no 


Cooperative hydrogen bonds 


between and (i ± 1, j ± 1) 


/9-sheet zig-zag pattern ( d ' 9 ^ 


h ■ b i± i < 0, bj ■ b j± i < 


energy per pair 


-0.3 


amino acid specific? 


no 


Bending rigidity 


Ri-l,i,i+l < 6.2A 


energy 


en 


amino acid specific? 


yes (for a heteropolymer) 


Hydrophobic contact 


j>i + 2 g 


C a -C a distance 


< 7. 5 A 


energy 


e w 


amino acid specific? 


yes (for a heteropolymer) 



( a ) Rijk is the radius of a circle drawn through the C a positions of i, j and k 
^ N is the number of residues 

( c ) each residue is constrained to form no more than 2 hydrogen bonds (except the 
residues located at the chain termini which form at most 1 hydrogen bond) 

( d ) applied only when the corresponding binormal vectors exist 
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( e ) for i = 1 and (or) j = N this is replaced by the constraint that the connecting 
vector is making an angle between 70° and 110° with the extremal peptide links. 
(A the connecting vector, c%j = rij/rij, is a unit vector joining i and j 
^ applied when at least one of the two cooperative hydrogen bonds is non-local 



Table Legends 
Table 1 

Summary of all geometrical and energetical parameters involved in the model 
definition. All geometrical properties has been derived via a thorough anal- 
ysis of PDB native structures. 
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Figure Legends 



Figure 1 

Sketch of the local coordinate system. For each C a atom i (except the first 
and the last one), the axes of a right-handed local coordinate system are 
defined as follows. The tangent vector t{ is parallel to the segment joining 
i — 1 with i + i. The normal vector hi joins % to the center of the circle 
passing through i — 1, i, and i + 1 and it is perpendicular to £j. £j and hi 
along with the three contiguous C a atoms lie in a plane shown in the figure. 
The binormal vector 6; is perpendicular to this plane. The vectors t iy hi, hi 
are normalized to unit length. 

Figure 2 

Sketch of a portion of a protein chain. The black spheres represent the 
C a atoms of the amino acids. The local radius of curvature r is defined 
as the radius of the circle passing through three consecutive atoms and is 
constrained to lie between 2.5 A and 7. 9 A (r max ). A penalty e# is imposed 
when 2.5 < r < 3.2 (see (b)). The hydrophobic interaction, ew, is operative 
when two atoms separated by more than two along the sequence are within 
7.5A of each other (see (c)). Note that two non-adjacent atoms cannot be 
closer than 4A. A flexible tube is characterized by the constraint that none 
of the three-body radii is less than the tube thickness, chosen here to be 2.5A 
(see (b) and (d)). 

Figure 3 

Phase diagram of ground state conformations. 

The ground state conformations were obtained using Monte-Carlo simu- 
lations of chains of 24 C a atoms, cr and ew denote the local radius of curva- 
ture energy penalty and the solvent mediated interaction energy respectively. 
Over 600 distinct local minima were obtained in different parts of parameter 
space starting from a random conformation and successively distorting the 
chain with pivot and crankshaft moves commonly used in stochastic chain 
dynamics 43 . A Metropolis Monte-Carlo procedure is employed with a ther- 
mal weight exp (-E/T), where E is the energy of the conformation and the 
temperature T is set initially at a high value and then decreased gradually 
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to zero. In the orange phase, the ground state is a 2-stranded /3-hairpin. 
Two distinct topologies of a 3-stranded /3-sheet (dark and light blue phases) 
are found corresponding to conformations shown in conformations i and j in 
Fig. 4 respectively. The helix bundle shown in conformation b in Fig. 4 is 
the ground state in the green phase whereas the ground state conformation 
in the magenta phase has a slightly different arrangement of helices. The 
white region in the left of the phase diagram has large attractive values of 
ew and the ground state conformations are compact globular structures with 
a crystalline order induced by hard sphere packing considerations^ and not 
by hydrogen bonding (conformation 1 in Fig. 4). 

Figure 4 

MolScript representation of the most common structures obtained in our 
simulations. 

Helices and strands are assigned when local or non-local hydrogen bonds 
are formed according to the described rules. Conformations (a), (b), (h), (i), 
(j), and (k) are the stable ground states in different parts of the parameter 
space shown in Figure 4. Conformations (c), (d), (e), (f), and (g) are compet- 
itive local minima. Conformation (1) is that of a generic compact polymer 
chain, obtained by switching off hydrogen bonds, the tube constraint and 
curvature energy penalty and is obtained on maximizing the total number of 
hydrophobic contacts. 

Figure 5 

Contour plots of the effective free energy at high temperature (T = 0.22) 
and at the folding transition temperature Tf = 0.2. 

The effective free energy, defined as F(Ni + N n i,Nw) = — lnP(iVj + 
N n i,Nw), is obtained as a function of the total number of hydrogen bonds 
Ni + N n i and the total number of hydrophobic contacts Nw from the his- 
togram P(Ni + N n i,N w ) collected in equilibrium Monte-Carlo simulations 
at constant temperature. The spacing between consecutive levels in each 
contour plot is 1 and corresponds to a free energy difference of ksT, where 
T is the temperature in physical units. The darker the color, the lower the 
free energy value. There is just one free energy minimum corresponding to 
the denatured state at a temperature higher than the folding transition tem- 
perature (Panel (a)) whereas one can discern the existence of three distinct 
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minima at the folding transition temperature (Panel (b)). Typical confor- 
mations from each of the minima are shown in the figure. 
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